Exploring the Arts Council National Portfolio 2018-2022

On 27 June 2017, the Arts Council England announced the National Portfolio Organisations for the next four years. They also provided a factsheet and a dataset!

Now we've explored and had a stab at tidying the dataset, it's time to revisit it systematically.

Author: Edafe Onerhime

Created: 2018-08-04

Last Updated: 2018-08-05

Description: Systematic exploration of the original dataset and the tidy datasets: funding & portfolio.

Contents: Notebook folder with this Jupyter Notebook, data folder, research folder (not sync'd to GitHub)

Notes:

2018-08-05 - Changed direction with this notebook. Initially it was a loose exploration tool, now I'm using it to explore processes and artefacts that speed up systematic data analysis. This notebook will chop and change frequently!


In [1]:
%matplotlib inline

import os.path
import pandas as pd
import numpy as np
import missingno as msno

from IPython.core.display import display, HTML
from urllib.request import urlretrieve
from tabulate import tabulate

In [2]:
"""
    Making 'global' variables we'll re-use later easier to set
"""
originalFile = {'fname': 'NPO_successful_20182022_July_6.xlsx',
                'url': 'http://www.artscouncil.org.uk/sites/default/files/download-file/NPO_successful_20182022_July_6.xlsx'}

dataFolderOut = os.path.join(os.pardir,'data','output')
dataFolderIn = os.path.join(os.pardir,'data','input')

In [3]:
"""
    All the functions we'll need
"""

def customInfo(dfIn):
    """
        Summary text and table based on Dataframe.info()
    """
    
    df = dfIn.copy(deep=True)
    df.df_name = dfIn.df_name
    
    CI = 'The dataset ' + "'" + df.df_name + "'" + \
         ' has ' + str(df.shape[1]) + ' columns and ' + str(df.shape[0]) + ' rows. Columns: '
    s = df.get_dtype_counts()
    for ind, val in s.iteritems():
        CI = CI + str(val) + ' ' + ind +', '            
    CI = CI[:-2]+'.'

    dfCI = pd.DataFrame()
    df.reset_index(inplace=True)
    dfCI['count'] = df.count()
    dfCI['unique'] = df.T.apply(lambda x: x.nunique(), axis=1)
    dfCI['min'] = df.min()
    dfCI['max'] = df.max()
    dfCI['median'] = df.median()
    dfCI['range'] = dfCI['max'].apply(pd.to_numeric, errors='coerce') - dfCI['min'].apply(pd.to_numeric, errors='coerce')
    try:
        dfCI['iqr'] = df.quantile(0.75) - df.quantile(0.25)    
    except:
        dfCI['iqr'] = np.nan
    dfCI.sort_index(inplace=True)
    return dfCI, CI

def portfolioTransform(dfIn, ColumnName, Year, Period, Type):
    """ 
        Returns a dataframe with Portfolio Funding amount, Year e.g. 2015 and Period e.g. 2015-2018
    """
    df = dfIn[[ColumnName]].copy(deep=True)
    df['Year'] = Year
    df['Period'] = Period
    df['Type'] = Type
    df.rename(columns={ColumnName:'Amount'}, inplace=True)
    return df

def describeCategory(dfIn, measure=''):
    """
        Returns an array of dataframes showing count and proportion of each category
        (1 dataframe per category column)
        When a measure is provided, use the measure instead of the count
    """
    
    df = dfIn.copy(deep=True).reset_index()
    dataframes = []
    i = 0
    
    if len(measure)==0:
        for catCol in list(df.select_dtypes(include=['category']).columns):
            dataframes.append(pd.DataFrame(df[catCol].value_counts()).reset_index().rename(columns={catCol: 'Count', 'index': catCol}))
            dataframes[i]['%'] = dataframes[i][['Count']]/df.shape[0]
            dataframes[i].df_name = catCol
            i+= 1
    else:
        for catCol in list(df.select_dtypes(include=['category']).columns):
            dataframes.append(pd.DataFrame(df.reset_index().groupby(catCol)[measure].agg('sum').reset_index()))
            dataframes[i]['%'] = dataframes[i][[measure]]/df[measure].sum() 
            dataframes[i].df_name = catCol
            i+= 1
    return dataframes

In [4]:
"""
    Load datasets we'll use later
"""

# Fetch/Load Original
fname = os.path.join(dataFolderIn,originalFile['fname'])
if not os.path.isfile(fname):
    print('Downloading:',fname)
    url = originalFile['url']
    urlretrieve(url, fname)

df = pd.read_excel(open(fname,'rb'), sheetname=0)
df.df_name = 'Original'

"""
    Get summary information
"""
dfInfo, infoText = customInfo(df)

print(infoText)
dfInfo


The dataset 'Original' has 22 columns and 844 rows. Columns: 4 float64, 5 int64, 13 object.
Out[4]:
count unique min max median range iqr
% Cash change between 17/18 and 18/19 844 252 NaN NaN NaN NaN NaN
ACE Region 844 9 East Yorkshire NaN NaN NaN
Alternative Name 111 108 NaN NaN NaN NaN NaN
Applicant Name 844 831 NaN NaN NaN NaN NaN
Area 844 6 London South West NaN NaN NaN
Discipline 844 9 Combined arts Visual arts NaN NaN NaN
Funding Band 844 4 NaN NaN NaN NaN NaN
Funding Source (GIA or LOT) 844 2 GIA LOT NaN NaN NaN
Local Authority 844 184 Allerdale York NaN NaN NaN
Notes 37 16 NaN NaN NaN NaN NaN
ONS Region 844 10 East Midlands Yorkshire and The Humber NaN NaN NaN
Portfolio funded in 2015-18? 844 2 No Yes NaN NaN NaN
Portfolio funding 15/16 - £\n 844 570 0 2.4772e+07 125389.350598 24772000.0 2.548629e+05
Portfolio funding 16/17 - £\n 844 570 0 2.4772e+07 125389.350598 24772000.0 2.548629e+05
Portfolio funding 17/18 - £\n 844 570 0 2.4772e+07 125389.350598 24772000.0 2.548629e+05
Portfolio grant 18/19 - £ 844 632 40000 24028840 181022.000000 23988840.0 2.613698e+05
Portfolio grant 19/20 - £ 844 632 40000 24028840 181022.000000 23988840.0 2.613698e+05
Portfolio grant 20/21 - £ 844 632 40000 24028840 181022.000000 23988840.0 2.613698e+05
Portfolio grant 21/22 - £ 844 632 40000 24028840 181022.000000 23988840.0 2.613698e+05
TOTAL Portfolio funding 15/18 - £\n 844 570 0 7.4316e+07 376168.051793 74316000.0 7.645886e+05
TOTAL Portfolio grant 18/22 - £ 844 632 160000 96115360 724088.000000 95955360.0 1.045479e+06
Web address 844 829 - https://www.unicorntheatre.com/ NaN NaN NaN
index 844 844 0 843 421.500000 843.0 4.215000e+02

In [5]:
"""
    Tidy and enrich the original dataset
"""

# Strip out carriage returns in column names
df = df.rename(columns={col: col.replace('\n','').replace('  ',' ') for col in df.columns})
df.df_name = 'All'

# Strip out text value 'NEW' so column is all numbers
df.loc[df['% Cash change between 17/18 and 18/19'].str.contains('NEW', na=False), '% Cash change between 17/18 and 18/19'] = np.NaN

# Make these columns categories
catCols = ['Funding Band', 'Area', 'ACE Region', 'ONS Region', 'Discipline', 'Funding Source (GIA or LOT)']
df[catCols] = df[catCols].apply(pd.Categorical)

# Make this column true or false (Bool)
df.loc[df['Portfolio funded in 2015-18?'].str.contains('Yes', na=False), 'Portfolio funded in 2015-18?'] = True
df.loc[df['Portfolio funded in 2015-18?'].str.contains('No', na=False), 'Portfolio funded in 2015-18?'] = False
df['Portfolio funded in 2015-18?'] = df['Portfolio funded in 2015-18?'].astype('bool')

# Make these columns ints
intCols = ['Portfolio funding 15/16 - £', 'Portfolio funding 16/17 - £', 'Portfolio funding 17/18 - £', 
           'TOTAL Portfolio funding 15/18 - £']
df[intCols] = df[intCols].astype('int')

# Enrich: Funding Source (GIA or LOT) using descriptions here: https://www.artsprofessional.co.uk/news/npo-scheme-merge-ace-grant-aid-and-lottery-funding
df['Funding Source Description'] = df['Funding Source (GIA or LOT)'].astype(str)
df.loc[df['Funding Source Description'].str.contains('GIA', na=False), 'Funding Source Description'] = 'Grant in Aid'
df.loc[df['Funding Source Description'].str.contains('LOT', na=False), 'Funding Source Description'] = 'National Lottery'

# Enrich: Funding Bands using descriptions here: http://www.artscouncil.org.uk/sites/default/files/download-file/Briefing_current_NPOs_MPMs_FINAL.pptx
df['Funding Band Description'] = df['Funding Band'].astype(str)
df.loc[~df['Funding Band Description'].str.contains('SSO'), 'Funding Band Description'] = 'Band ' + df['Funding Band Description'].astype(str)
df.loc[df['Funding Band Description'].str.contains('SSO'), 'Funding Band Description'] = 'Sector Support Organisation'

# Pull information out of notes: Bridge Organisation, Uplift, Reduction, Museum Development
df['Bridge Organisation'] = False
df['Technical Uplift'] = False
df['Technical Reduction'] = False
df['Museum Development'] = False
df.loc[df['Notes'].str.lower().str.contains('bridge organisation', na=False), 'Bridge Organisation'] = True
df.loc[df['Notes'].str.lower().str.contains('uplift', na=False), 'Technical Uplift'] = True
df.loc[df['Notes'].str.lower().str.contains('reduction', na=False), 'Technical Reduction'] = True
df.loc[df['Notes'].str.lower().str.contains('museum development', na=False), 'Museum Development'] = True

# Add Index (Looks like Applicant Name & Funding Band are unique)
df.set_index(['Applicant Name', 'Funding Band'], inplace=True)

"""
    Split into Applicant, Classification, Portfolio and save as csv
"""

dfApplicant = df[['Alternative Name', 'Web address', 'Discipline', 'Notes', 'Area', 'ACE Region', 'ONS Region', 'Local Authority']]
dfApplicant.df_name = 'Applicant'

dfClassification = df[['Funding Source (GIA or LOT)', 'Funding Source Description', 'Funding Band Description', 'Portfolio funded in 2015-18?', 'Bridge Organisation', 'Technical Uplift', 'Technical Reduction', 'Museum Development']]
dfClassification.df_name = 'Classification'

dfPortfolio = portfolioTransform(df,'Portfolio funding 15/16 - £','2015','2015-2018','Fund')
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio funding 16/17 - £','2016','2015-2018','Fund'))
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio funding 17/18 - £','2017','2015-2018','Fund'))
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio grant 18/19 - £','2018','2018-2022','Grant'))
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio grant 19/20 - £','2019','2018-2022','Grant'))
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio grant 20/21 - £','2020','2018-2022','Grant'))
dfPortfolio = dfPortfolio.append(portfolioTransform(df,'Portfolio grant 21/22 - £','2021','2018-2022','Grant'))

# Make Period, Type categories then add Year to index
catCols = ['Year','Period','Type']
dfPortfolio[catCols] = dfPortfolio[catCols].apply(pd.Categorical)
dfPortfolio.set_index('Year', append=True, inplace=True)
dfPortfolio.df_name = 'Portfolio'

# Save csv
filePrefix = 'NPO_2018-22_'
df.to_csv(os.path.join(dataFolderOut,filePrefix+'all.csv'))
dfApplicant.to_csv(os.path.join(dataFolderOut,filePrefix+'applicant.csv'))
dfPortfolio.to_csv(os.path.join(dataFolderOut,filePrefix+'portfolio.csv'))
dfClassification.to_csv(os.path.join(dataFolderOut,filePrefix+'classification.csv'))

In [6]:
"""
    Get summary information
"""
dfInfo, infoText = customInfo(df)
print(infoText)
dfInfo, infoText = customInfo(dfApplicant)
print(infoText)
dfInfo, infoText = customInfo(dfPortfolio)
print(infoText)
dfInfo, infoText = customInfo(dfClassification)
print(infoText)


The dataset 'All' has 26 columns and 844 rows. Columns: 5 bool, 5 category, 9 int64, 7 object.
The dataset 'Applicant' has 8 columns and 844 rows. Columns: 4 category, 4 object.
The dataset 'Portfolio' has 3 columns and 5908 rows. Columns: 2 category, 1 int64.
The dataset 'Classification' has 8 columns and 844 rows. Columns: 5 bool, 1 category, 2 object.

In [7]:
"""
    Landscape: Check for missing values using MissingNo
    https://github.com/ResidentMario/missingno
"""

msno.matrix(df)
msno.heatmap(df)
msno.bar(df)
msno.dendrogram(df)



In [8]:
print('\n', 'Applicant')
lstDesc = describeCategory(dfApplicant)
for dfDesc in lstDesc:
    print('\n',dfDesc.df_name)
    display(HTML(tabulate(dfDesc,headers='keys', tablefmt='html', showindex=False, )))
    
print('\n', 'Portfolio')
lstDesc = describeCategory(dfPortfolio)
for dfDesc in lstDesc:
    print('\n',dfDesc.df_name)
    display(HTML(tabulate(dfDesc,headers='keys', tablefmt='html', showindex=False, )))    

print('\n', 'Classification')
lstDesc = describeCategory(dfClassification)
for dfDesc in lstDesc:
    print('\n',dfDesc.df_name)
    display(HTML(tabulate(dfDesc,headers='keys', tablefmt='html', showindex=False, )))


 Applicant

 Funding Band
Funding Band Count %
1 5300.627962
2 1900.225118
3 660.0781991
SSO 580.0687204
 Discipline
Discipline Count %
Theatre 1900.225118
Combined arts 1870.221564
Visual arts 1490.17654
Music 1020.120853
Museums 720.0853081
Dance 640.0758294
Literature 490.0580569
Not discipline specific 240.028436
Libraries 70.00829384
 Area
Area Count %
London 2530.299763
North 2250.266588
Midlands 1290.152844
South West 1030.122038
South East 1000.118483
National 340.0402844
 ACE Region
ACE Region Count %
London 2680.317536
South West 1040.123223
North West 940.111374
Yorkshire 930.11019
West Midlands 760.0900474
South East 570.0675355
East Midlands 560.0663507
North East 480.056872
East 480.056872
 ONS Region
ONS Region Count %
London 2680.317536
North West 940.111374
Yorkshire and The Humber 930.11019
South West 860.101896
South East 740.0876777
West Midlands 730.0864929
East Midlands 560.0663507
North East 480.056872
East of England 480.056872
WALES 40.00473934
 Portfolio

 Funding Band
Funding Band Count %
1 37100.627962
2 13300.225118
3 4620.0781991
SSO 4060.0687204
 Year
Year Count %
2021 8440.142857
2020 8440.142857
2019 8440.142857
2018 8440.142857
2017 8440.142857
2016 8440.142857
2015 8440.142857
 Period
Period Count %
2018-2022 33760.571429
2015-2018 25320.428571
 Type
Type Count %
Grant 33760.571429
Fund 25320.428571
 Classification

 Funding Band
Funding Band Count %
1 5300.627962
2 1900.225118
3 660.0781991
SSO 580.0687204
 Funding Source (GIA or LOT)
Funding Source (GIA or LOT) Count %
GIA 6940.822275
LOT 1500.177725

In [9]:
print('\n', 'Portfolio')
lstDesc = describeCategory(dfPortfolio,'Amount')
for dfDesc in lstDesc:
    print('\n',dfDesc.df_name)
    display(HTML(tabulate(dfDesc,headers='keys', tablefmt='html', showindex=False, )))


 Portfolio

 Funding Band
Funding Band Amount %
1 4229290450.155876
2 6334248390.233457
3 15207345440.560485
SSO 1361571700.0501824
 Year
Year Amount %
20153600139860.132688
20163600139860.132688
20173601539860.132739
20184076382760.15024
20194085686780.150583
20204083882760.150517
20214084684100.150546
 Period
Period Amount %
2015-201810801819580.398114
2018-202216330636400.601886
 Type
Type Amount %
Fund 10801819580.398114
Grant 16330636400.601886

In [ ]: